Unsupervised Learning on Breast Tumor Biopsieslibrary(MASS)
data("biopsy")
The biopsy from MASS package was used for this project.
Reading biopsy dataset from MASS package
The clustering functions (e.g., prcomp, hclust, and kmeans) used in this project are intrinsic to R-Base package. Other packages were needed for managing, and visualizing data and results.
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(gridExtra)
library(grid)
library(cluster)
library(DT)
library(ggplot2)
library(plotly)
library(gapminder)
library(purrr)
library(repurrrsive)
library(tibble)
library(dplyr)
library(tidyr)
library(reshape)
library(ggpmisc)
library(naniar)
library(fpc)
library(cluster)
library(factoextra)
library(fpc)
library(NbClust)
datatable(biopsy, filter = "top", options = list(pageLenght = 5, scrollX=T))
>Data source: ‘biopsy’ data documentation
Data columns
V1: clump thicknessV2: uniformity of cell size.V3: uniformity of cell shape.V4: marginal adhesion.V5: single epithelial cell size.V6: bare nuclei (16 values are missing).V7: bland chromatin.V8: normal nucleoli.V9: mitoses.class: “benign” or “malignant”.Data obtained from the University of Wisconsin Hospitals, Madison (Dr. Wolberg). It is based on assessment of bx of breast tumours for 699 patients. Each of nine attributes V1-V9 is scored on a scale of 1 to 10. The outcome class is also known i.e., benign / malignant.
To cluster the biopsy samples based on features of histopathological slides into two distinct groups, that will correlate with clinical diagnosis (i.e., benign or malignant tumor).
Whether the membership of the cluster, i.e., the samples within the cluster, are distinctively benign or malignant.
In clustering - each cluster is distinct from each other cluster - objects within each cluster are broadly similar to each other.
To use several unsupervised machine learning algorithms such as Principal Component Analysis, Hierarchical Clustering, and K-Means Clustering for building the unsupervised machine learning model and compare their clustering accuracies based on known diagnosis of the tumors.
dim(biopsy)
## [1] 699 11
str(biopsy)
## 'data.frame': 699 obs. of 11 variables:
## $ ID : chr "1000025" "1002945" "1015425" "1016277" ...
## $ V1 : int 5 5 3 6 4 8 1 2 2 4 ...
## $ V2 : int 1 4 1 8 1 10 1 1 1 2 ...
## $ V3 : int 1 4 1 8 1 10 1 2 1 1 ...
## $ V4 : int 1 5 1 1 3 8 1 1 1 1 ...
## $ V5 : int 2 7 2 3 2 7 2 2 2 2 ...
## $ V6 : int 1 10 2 4 1 10 10 1 1 1 ...
## $ V7 : int 3 3 3 3 3 9 3 3 1 2 ...
## $ V8 : int 1 2 1 7 1 7 1 1 1 1 ...
## $ V9 : int 1 1 1 1 1 1 1 1 5 1 ...
## $ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
The original data has 699 rows and 11 columns.
# check if any missing values in the data
anyNA(biopsy)
## [1] TRUE
# list rows of data that have missing values
grid.table(biopsy[!complete.cases(biopsy),])
# Visualizing missing values in the original data
par(mfrow=c(1,3))
vis_miss(biopsy) + theme_minimal()
gg_miss_var(biopsy, facet = class) + theme_minimal()
# create a subset of complete dataset without missing values
biopsy1 <- na.exclude(biopsy)
dim(biopsy1)
## [1] 683 11
A total of 16 missing values; all for
V6(i.e., missing not at random). These observations were deleted before applying clustering algorithms. The complete data has 683 rows and 11 columns.
table(biopsy$class)
##
## benign malignant
## 458 241
# Assigning a numeric value to pathological diagnosis based on features on the complete dataset
diagnosis <- as.numeric(biopsy1$class == "benign")
table(biopsy1$class)
##
## benign malignant
## 444 239
table(diagnosis)
## diagnosis
## 0 1
## 239 444
The last-column i.e., class specifies the specific diagnosis of the tumors. This variable will be used to assess the accuracy of clustering.
biopsy2 <- as.matrix(biopsy1[, 2:10])
str(biopsy2)
## int [1:683, 1:9] 5 5 3 6 4 8 1 2 2 4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:683] "1" "2" "3" "4" ...
## ..$ : chr [1:9] "V1" "V2" "V3" "V4" ...
head(biopsy2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 5 1 1 1 2 1 3 1 1
## 2 5 4 4 5 7 10 3 2 1
## 3 3 1 1 1 2 2 3 1 1
## 4 6 8 8 1 3 4 3 7 1
## 5 4 1 1 3 2 1 3 1 1
## 6 8 10 10 8 7 10 9 7 1
row.names(biopsy2) <- biopsy1$ID
datatable(biopsy2, filter = "top", options = list(pageLenght = 5, scrollX=T))
#head(biopsy2)
Creating a data matrix of the attributes (numeric). Unsupervised learning methods will be applied on this matrix.
mdata <- melt(biopsy, id=c("ID","class"))
p <- ggplot(data = mdata, aes(x=variable, y=value)) + geom_boxplot(aes(fill=class))
p + facet_wrap( ~ variable, scales="free")
Box-plots of Attributes V1-V9
Malignant tumors have higher values, on the scale of 1 to 10, for all the features compared to benign tumors.
V1-V9) than benign tumors.
# Perform T-Test on Attributes Between Groups
multi.tests <- function(fun = t.test, df, vars, group.var, ...) {
sapply(simplify = FALSE,
vars,
function(var) {
formula <- as.formula(paste(var, "~", group.var))
fun(data = df, formula, ...)
})}
res.multi.t.tests <- multi.tests(fun = t.test,
df = biopsy,
vars = c(names(biopsy[c(2:10)])),
group.var = "class",
var.equal = TRUE, na.rm=TRUE)
p <-data.frame(p.value = sapply(res.multi.t.tests, getElement, name = "p.value"))
p.value <-format.pval(pv = p, digits = 2, eps = 0.001, nsmall = 3)
et <- t(data.frame(estimate = round(sapply(res.multi.t.tests, getElement, name = "estimate"),2)))
ttest <- as.data.frame(cbind(et,p.value))
datatable(ttest, autoHideNavigation="TRUE", options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))))
Principal Component Analysis (PCA) is particularly useful when working with “wide” data sets. In datasets with many variables, it is often difficult to plot the data in its raw format, making it difficult to determine the trends present within the dataset. PCA enables visualization of the “shape” of the data, identifying which samples are similar to one another and which are very different. This can enable identification of groups of samples that are similar and determine which variables make one group different from another. DataCamp
biopsy.pr <- prcomp(biopsy2, scale = T, center = T)
Applying prcomp function (from R-Base Package) to execute principal component analysis after scaling and centering the features The prinicipal components are stored as an object biopsy.pr.
PCA is a type of linear transformation on a given data set that has values for a certain number of variables (coordinates) for a certain amount of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the most significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser variance. In this way, you transform a set of x correlated variables over y samples to a set of p uncorrelated principal components over the same samples. DataCamp
summary(biopsy.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion 0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
## PC8 PC9
## Standard deviation 0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion 0.99018 1.00000
There are nine prinicipal comoponents of which the first component (PC1) itself explains about 65% of the variability in the data, as shown by Proportion of Variance = 0.65; and the first two components (PC1, PC2) explain about 74% (as shown by the Cumulative proportion = 0.74 under PC2) of the variability. Thus, variability explained by all nine features can be explained by values of PC1 and PC2 only (dimensionality reduction).
library(ggbiplot)
biplot<- ggbiplot(biopsy.pr, pc.biplot = TRUE, scale= TRUE, obs.scale = 1, groups= biopsy1$class, labels = diagnosis, ellipse = TRUE, ellipse.prob = 0.9, alpha = 0.5, var.axes = TRUE, var.scale = 1, circle = TRUE)
ggplotly(biplot)
Biplot of PC1 and PC2
The
correlation circlevisualizes the correlation between the first two principal components and the 9 dataset features. All the 8 features (V1-V8) are aligned close together and parallel to PC1 axis. Only one feature,V9, is aligned orthogonal to others and parallel to PC2. Thus,PC1alone explains most of the variability explained by all the 8 features (V1-V8) and combined withPC2, can explain all the variability in teh data and differentiate between benign and malignant tumors.
library(tidyverse)
plot(biopsy.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
Scatter plot observations by components 1 and 2
# Repeat for components 1 and 3
plot(biopsy.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
Scatter plot observations by components 1 and 3
# Do additional data exploration of your choosing below (optional)
plot(biopsy.pr$x[, c(2, 3)], col = (diagnosis + 1),
xlab = "PC2", ylab = "PC3")
Scatter plot observations by components 2 and 3
par(mfrow = c(1, 2))
# Calculate variability of each component: pr.var
pr.var <- biopsy.pr$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Variance explained by each component and cumulative variance by all components
The first two principal component explain most of the variability in the data
# Scale the biopsy2 data: data.scaled
data.scaled <- scale(biopsy2)
head(data.scaled)
## V1 V2 V3 V4 V5 V6
## 1000025 0.1977598 -0.7016978 -0.7412304 -0.63889730 -0.5552016 -0.6983413
## 1002945 0.1977598 0.2770488 0.2625905 0.75747664 1.6939247 1.7715689
## 1015425 -0.5112687 -0.7016978 -0.7412304 -0.63889730 -0.5552016 -0.4239068
## 1016277 0.5522740 1.5820442 1.6010185 -0.63889730 -0.1053763 0.1249621
## 1017023 -0.1567545 -0.7016978 -0.7412304 0.05928967 -0.5552016 -0.6983413
## 1017122 1.2613024 2.2345419 2.2702324 1.80475710 1.6939247 1.7715689
## V7 V8 V9
## 1000025 -0.181694 -0.6124785 -0.3481446
## 1002945 -0.181694 -0.2848960 -0.3481446
## 1015425 -0.181694 -0.6124785 -0.3481446
## 1016277 -0.181694 1.3530163 -0.3481446
## 1017023 -0.181694 -0.6124785 -0.3481446
## 1017122 2.267589 1.3530163 -0.3481446
Scaling feature values before clustering process: Feature values from each row are represented as coordinates in n-dimensional space (n is the number of features) and then the distances between these coordinates are calculated. If these coordinates are not normalized, then it may lead to false results. Ref: Hierarchical Clustering in R
Euclidean distance is used as an input for the clustering algorithm. The proximity matrix containing the distance between each point is determined using a distance function.
# Calculate similarity as Euclidean distance between observations
data.dist <- dist(data.scaled, method = "euclidean")
Calculated (Euclidean) distance is stored as an object data.dist
biopsy.hclust <- hclust(data.dist, method = "complete")
biopsy.hclust2 <- hclust(data.dist, method = "mcquitty")
Create a hierarchical clustering model using
hclustfunction and two separate methods (i.e.completeandmcquitty); both models are stored as objects:biopsy.hclustandbiopsy.hclust2
Details on hclust
plot(biopsy.hclust)
Results of hierarchical clustering
plot(biopsy.hclust2)
Results of hierarchical clustering
Cutting the height at 9 will give 2 clusters
# Cut by number of clusters k
plot(biopsy.hclust)
biopsy.hclust.clusters <- cutree(biopsy.hclust, k = 2)
rect.hclust(biopsy.hclust, k=2, border="red")
Results of hierarchical clustering
plot(biopsy.hclust2)
biopsy.hclust2.clusters <- cutree(biopsy.hclust2, k = 2)
rect.hclust(biopsy.hclust2, k=2, border="red")
Results of hierarchical clustering
Using
cutree()onbiopsy.hclust, assigns cluster membership to each observation. Assumed two clusters and assigned the result to a vector calledbiopsy.hclust.clusters.
# Clusters using 'complete' method
table(biopsy.hclust.clusters)
## biopsy.hclust.clusters
## 1 2
## 541 142
thc <- table(biopsy.hclust.clusters, biopsy1$class)
thc
##
## biopsy.hclust.clusters benign malignant
## 1 442 99
## 2 2 140
# Clusters using 'mcquitty' method
table(biopsy.hclust2.clusters)
## biopsy.hclust2.clusters
## 1 2
## 450 233
thc2 <- table(biopsy.hclust2.clusters, biopsy1$class)
thc2
##
## biopsy.hclust2.clusters benign malignant
## 1 435 15
## 2 9 224
Compare cluster membership to actual diagnoses based on
completeandmcquittymethod ofhclust
sum(apply(table(biopsy.hclust.clusters, diagnosis), 1, min))
## [1] 101
sum(apply(table(biopsy.hclust2.clusters, diagnosis), 1, min))
## [1] 24
- Count out of place observations based on cluster by summing the row minimums
Based on “complete” h-clustering method, 101 tumors do not agree with the actual diagnosis Based on “mcquitty” h-clustering method, 24 tumors do not agree with the actual diagnosis
complete method
torg<-table(biopsy1$class)
biop <- c("benign", "malignant")
actual <- factor(rep(biop, times = c(torg[1], torg[2])), levels = rev(biop))
predhc <- factor(
c(
rep(biop, times = c(thc[1], thc[2])),
rep(biop, times = c(thc[3], thc[4]))),
levels = rev(biop))
xtab.hclust <- table(predhc, actual)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
cmhclust <- confusionMatrix(xtab.hclust)
cmhclust$overall['Accuracy']
## Accuracy
## 0.852123
Accuracy H-Clust
completemethod is 0.85
mcquitty method
biop <- c("benign", "malignant")
actual <- factor(rep(biop, times = c(torg[1], torg[2])), levels = rev(biop))
predhc2 <- factor(
c(
rep(biop, times = c(thc2[1], thc2[2])),
rep(biop, times = c(thc2[3], thc2[4]))),
levels = rev(biop))
xtab.hclust2 <- table(predhc2, actual)
library(caret)
cmhclust2 <- confusionMatrix(xtab.hclust2)
cmhclust2$overall['Accuracy']
## Accuracy
## 0.9648609
Accuracy H-Clust
mcquittymethod is 0.96
The data are clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centres is minimized.
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in 1:15) {
# Fit the model: km.out
km.out <- kmeans(biopsy2, centers = i, nstart = 20, iter.max = 50)
# Save the within cluster sum of squares
wss[i] <- km.out$tot.withinss
}
# Produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Fitting a k-means model to the data using 2 centers and run the k-means algorithm 20 times. The result will be stored in biopsy.km
set.seed(4)
# Select number of clusters
k <- 2
biopsy.km <- kmeans(scale(biopsy2), centers = 2, nstart = 20, iter.max = 10, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace = FALSE)
KM Cluster Model created using
kmeansfunction while applying scaling on features, and stored as an objectbiopsy.km
The cluster membership of the biopsy.km model object is contained in its cluster component and is accessed with the $ operator.
clusplot(biopsy2, biopsy.km$cluster, main='2D representation of the Cluster solution',
color=TRUE, shade=TRUE, labels=2, lines=0)
Results of K-Means clustering
table(biopsy.km$cluster)
##
## 1 2
## 230 453
tkmc <- table(biopsy.km$cluster, biopsy1$class)
tkmc
##
## benign malignant
## 1 10 220
## 2 434 19
Compare cluster membership to actual diagnoses based on K-Means clustering Based on K-Means Clustering, two clusters of 453 and 230 samples are created. In the former group of 453, the actual number of benign samples are 434 and malignant samples are 19. Of the 230 samples in the second cluster, there are 10 benign samples and 220 malignant samples
sum(apply(table(biopsy.km$cluster, diagnosis), 1, min))
## [1] 29
Number of Counts out of place observations based on cluster by summing the row minimums Based on the K-Means clustering, 29 tumors do not agree with the actual diagnosis
torg <-table(biopsy1$class)
biop <- c("benign", "malignant")
actual <- factor(rep(biop, times = c(torg[1], torg[2])), levels = rev(biop))
predkm <- factor(
c(
rep(biop, times = c(tkmc[2], tkmc[1])),
rep(biop, times = c(tkmc[4], tkmc[3]))),
levels = rev(biop))
xtab.kmeans <- table(predkm, actual)
library(caret)
cmkmeans<-confusionMatrix(xtab.kmeans)
cmkmeans$overall['Accuracy']
## Accuracy
## 0.9575403
Accuracy of K-Means clustering method is 0.957
summary(biopsy.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion 0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
## PC8 PC9
## Standard deviation 0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion 0.99018 1.00000
biopsy.pr.hclust <- hclust(dist(biopsy.pr$x[, 1:9]), method = "complete")
biopsy.pr.hclust2 <- hclust(dist(biopsy.pr$x[, 1:9]), method = "mcquitty")
Create a hierarchical clustering model and stored as object
biopsy.pr.hclust
biopsy.pr.hclust.clusters <- cutree(biopsy.pr.hclust, k = 2)
biopsy.pr.hclust2.clusters <- cutree(biopsy.pr.hclust2, k = 2)
Cut model into 2 clusters and stored as an object
biopsy.pr.hclust.clusters
# Compare to actual diagnoses
table(biopsy.pr.hclust.clusters)
## biopsy.pr.hclust.clusters
## 1 2
## 544 139
tpc <- table(biopsy.pr.hclust.clusters, biopsy1$class)
tpc
##
## biopsy.pr.hclust.clusters benign malignant
## 1 442 102
## 2 2 137
sum(apply(tpc, 1, min))
## [1] 104
104 observations were not clustered accurately using the hierarchical clustering of principal components
# Compare to actual diagnoses
table(biopsy.pr.hclust2.clusters)
## biopsy.pr.hclust2.clusters
## 1 2
## 450 233
tpc2 <- table(biopsy.pr.hclust2.clusters, biopsy1$class)
tpc2
##
## biopsy.pr.hclust2.clusters benign malignant
## 1 435 15
## 2 9 224
sum(apply(tpc2, 1, min))
## [1] 24
24 observations were not clustered accurately using the hierarchical clustering of principal components
torg<-table(biopsy1$class)
biop <- c("benign", "malignant")
actual <- factor(rep(biop, times = c(torg[1], torg[2])), levels = rev(biop))
predpc <- factor(
c(
rep(biop, times = c(tpc[1], tpc[2])),
rep(biop, times = c(tpc[3], tpc[4]))),
levels = rev(biop))
xtab.pc <- table(predpc, actual)
library(caret)
cmpc<-confusionMatrix(xtab.pc)
cmpc$overall['Accuracy']
## Accuracy
## 0.8477306
torg<-table(biopsy1$class)
biop <- c("benign", "malignant")
actual <- factor(rep(biop, times = c(torg[1], torg[2])), levels = rev(biop))
predpc2 <- factor(
c(
rep(biop, times = c(tpc2[1], tpc2[2])),
rep(biop, times = c(tpc2[3], tpc2[4]))),
levels = rev(biop))
xtab.pc2 <- table(predpc2, actual)
library(caret)
cmpc2<-confusionMatrix(xtab.pc2)
cmpc2$overall['Accuracy']
## Accuracy
## 0.9648609
Clustering results are evaluated based on some externally known result, such as externally provided class labels i.e., benign/malignant for this dataset.
cluster_models <- as.data.frame(list(
'K Means' = round(cmkmeans$overall, 3),
'H Clust complete' = round(cmhclust$overall, 3),
'H Clust mcquitty' = round(cmhclust2$overall, 3),
'Pr.Comp HClust.comp' = round(cmpc$overall, 3),
'Pr.Comp HClust.mcquitty' = round(cmpc2$overall, 3)
))
datatable(t(cluster_models))
mcquitty method was the most accurate model followed by K-Means clustering for clustering benign and malignant breast tumor samples.The clustering result is evaluated based on the data clustered itself (internal information) without reference to external information. Internal validation measures reflect often the compactness, the connectedness and separation of the cluster partitions. Measures include: Dunn Index \((=min. separation/max. dia)\), Average Silhouette Width, Separation Index
cshc1 <- cluster.stats(data.dist, biopsy.hclust.clusters)
cshc2<-cluster.stats(data.dist, biopsy.hclust2.clusters)
cskm <- cluster.stats(data.dist, biopsy.km$cluster)
DI<-as.data.frame(list(cshc1$dunn, cshc2$dunn, cskm$dunn))
IntVal <- data.frame("Dunn-Index" = c(cshc1$dunn, cshc2$dunn, cskm$dunn),
"Silhouette-Width" = c(cshc1$avg.silwidth, cshc2$avg.silwidth, cskm$avg.silwidth),
"Separation-Index" = c(cshc1$sindex, cshc2$sindex, cskm$sindex ))
row.names(IntVal) <- c("Hierarchical Clustering 'complete'", "Hierarchical Clustering 'mcquitty'", "K-Means")
datatable(IntVal, style="default", height = "auto", width = "auto")
mcquitty method provides the most accurate clustering of breast tumor samples. Inaccuracies can be addressed by additional methods of assessment, including clinical judgement, and biomarker assays, to confirm or rule out malignant tumors and vice versa.Champion Model: hierarchical clustering using mcquitty method
Created in R Markdown by Umesh Singh
singhuh@ucmail.uc.edu